基于Vision Transformer的中文唇语识别

doi:10.16451/j.cnki.issn1003-6059.202212006

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (1611 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要唇语识别作为一种将唇读视频转换为文本的多模态任务,旨在理解说话者在无声情况下表达的意思.目前唇语识别主要利用卷积神经网络提取唇部视觉特征,捕获短距离像素关系,难以区分相似发音字符的唇形.为了捕获视频图像中唇部区域像素之间的长距离关系,文中提出基于Vision Transformer(ViT)的端到端中文句子级唇语识别模型,融合ViT和门控循环单元(Gate Recurrent Unit, GRU),提高对嘴唇视频的视觉时空特征提取能力.具体地,首先使用ViT的自注意力模块提取嘴唇图像的全局空间特征,再通过GRU对帧序列时序建模,最后使用基于注意力机制的级联序列到序列模型实现对拼音和汉字语句的预测.在中文唇语识别数据集CMLR上的实验表明,文中模型的汉字错误率较低.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	薛峰
	洪自坤
	李书杰
	李雨
	谢胤岑

关键词 ：唇语识别, Vision Transformer(ViT), 深度神经网络, 编解码器, 注意力机制, 特征提取

Abstract：Lipreading is a multimodal task to convert lipreading videos into text, and it is intended to understand the meaning expressed by a speaker in the absence of sound. In the existing lipreading methods, convolutional neural networks are adopted to extract visual features of the lips and capture short-distance pixel relationships, resulting in difficulties in distinguishing lip shapes of similarly pronounced characters. To capture the long-distance relationship between pixels in the lip region of the video images, an end-to-end Chinese sentence-level lipreading model based on vision transformer(ViT) is proposed. The ability of the model to extract visual spatio-temporal features from lip videos is improved by fusing ViT and Gate Recurrent Unit(GRU). Firstly, the global spatial features of lip images are extracted using the self-attention module of ViT. Then, GRU is employed to model the temporal sequence of frames. Finally, the cascading sequence-to-sequence model based on the attention mechanism is utilized to predict Chinese pinyin and Chinese character utterances. Experimental results on Chinese lipreading dataset CMLR show that the proposed model produces a lower Chinese character error rate.

Key words： Lipreading Vision Transformer(ViT) Deep Neural Network Encoder-Decoder Attention Mechanism Feature Extraction

收稿日期: 2022-07-08

ZTFLH:

TP391.41

基金资助:国家自然科学基金项目(No.62272143)、安徽高校协同创新项目(No.GXXT-2022-054)、安徽省重大科技专项项目(No.202203a05020025)、安徽省第七届创新创业人才特殊支持计划资助

通讯作者: 薛峰,博士,教授,主要研究方向为人工智能、多媒体分析、推荐系统.E-mail:feng.xue@hfut.edu.cn.

作者简介: 洪自坤,硕士研究生,主要研究方向为计算机视觉.E-mail:hongzikun@mail.hfut.edu.cn.李书杰,博士,讲师,主要研究方向为计算机视觉、人体姿态估计.E-mail:lisjhfut@hfut.edu.cn.李雨,博士研究生,主要研究方向为计算机视觉.E-mail:yuli@mail.hfut.edu.cn.谢胤岑,硕士研究生,主要研究方向为计算机视觉.E-mail:2021111090@mail.hfut.edu.cn.

引用本文:

薛峰, 洪自坤, 李书杰, 李雨, 谢胤岑. 基于Vision Transformer的中文唇语识别[J]. 模式识别与人工智能, 2022, 35(12): 1111-1121. XUE Feng, HONG Zikun, LI Shujie, LI Yu, XIE Yincen. Chinese Lipreading Network Based on Vision Transformer. Pattern Recognition and Artificial Intelligence, 2022, 35(12): 1111-1121.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202212006 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2022/V35/I12/1111

[1] ASSAEL Y M, SHILLINGFORD B, WHITESON S, et al. LipNet: End-to-End Sentence-Level Lipreading[C/OL].[2022-07-07]. https://arxiv.org/pdf/1611.01599.pdf.
[2] HUANG Y Y, LIANG X F, FANG C W. CALLip: Lipreading Using Contrastive and Attribute Learning // Proc of the 29th ACM International Conference on Multimedia. New York, USA: ACM, 2021: 2492-2500.
[3] WENG X S, KITANI K. Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading // Proc of the 30th British Machine Vision Conference[C/OL]. [2022-07-07].https://arxiv.org/pdf/1905.02540v1.pdf.
[4] JI S W, XU W, YANG M, et al. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1): 221-231.
[5] CHUNG J, GULCEHRE C, CHO K, ,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling[C/OL]. [2022-07-07]. https://arxiv.org/pdf/1412.3555.pdf.
[6] GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist Tem-poral Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks // Proc of the 23rd International Confe-rence on Machine Learning. New York, USA: ACM, 2006: 369-376.
[7] CHUNG J S, SENIOR A, VINYALS O, et al. Lip Reading Sentences in the Wild // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 3444-3450.
[8] SUTSKEVER I, VINYALS O, LE Q V. Sequence to Sequence Lear-ning with Neural Networks // Proc of the 27th International Confe-rence on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2014: 3104-3112.
[9] ZHANG T, HE L, LI X D, et al. Efficient End-to-End Sentence-Level Lipreading with Temporal Convolutional Networks. Applied Sciences, 2021, 11(15): 6975-6987.
[10] MA P C, MARTINEZ B, PETRIDIS S, et al. Towards Practical Lipreading with Distilled and Efficient Models // Proc of the IEEE International Conference on Acoustics, Speech and Signal Proce-ssing. Washington, USA: IEEE, 2021: 7608-7612.
[11] MA P C, PETRIDIS S, PANTIC M. Visual Speech Recognition for Multiple Languages in the Wild. Nature Machine Intelligence, 2022, 4: 930-939.
[12] ZHANG X B, GONG H G, DAI X L, et al. Understanding Pictograph with Facial Features: End-to-End Sentence-Level Lip Reading of Chinese. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 9211-9218.
[13] ZHAO Y, XU R, SONG M L. A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading // Proc of the ACM Multimedia Asia. New York, USA: ACM, 2019. DOI: 10.1145/3338533.3366579.
[14] DENG M H, XIONG S W. Phoneme-Based Lipreading of Silent Sentences // Proc of the IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers. Washington, USA: IEEE, 2022: 206-210.
[15] CHUNG J S, ZISSERMAN A. Lip Reading in the Wild // Proc of the Asian Conference on Computer Vision. Berlin, Germany: Springer, 2016: 87-103.
[16] STAFYLAKIS T, TZIMIROPOULOS G. Combining Residual Networks with LSTMs for Lipreading[C/OL]. [2022-07-07].https://arxiv.org/pdf/1703.04105.pdf.
[17] XU K, LI D W, CASSIMATIS N, et al. LCANet: End-to-End Lipreading with Cascaded Attention-CTC // Proc of the 13th IEEE International Conference on Automatic Face and Gesture Recognition. Washington, USA: IEEE, 2018: 548-555.
[18] JEON S, ELSHARKAWY A, KIM M S. Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Le-vel Visual Speech Recognition. Sensors, 2022, 22(1): 72-91.
[19] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale[C/OL].[2022-07-07]. https://arxiv.org/pdf/2010.11929.pdf.
[20] ZAREMBA W, SUTSKEVER I, VINYALS O. Recurrent Neural Network Regularization[C/OL]. [2022-07-07]. https://arxiv.org/pdf/1409.2329.pdf.
[21] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press 2017: 6000-6010.
[22] HENDRYCKS D, GIMPEL K. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units[C/OL]. [2022-07-07].https://arxiv.org/pdf/1606.08415.pdf.
[23] COOKE M, BARKER J, CUNNINGHAM S, et al. An Audio-Vi-sual Corpus for Speech Perception and Automatic Speech Recog-nition. The Journal of the Acoustical Society of America, 2006, 120(5): 2421-2424.
[24] YANG S, ZHANG Y H, FENG D L, et al. LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild // Proc of the 14th IEEE International Conference on Automatic Face and Gesture Recognition. Washington, USA: IEEE, 2019. DOI: 10.1109/FG.2019.8756582
[25] BENGIO S, VINYALS O, JAITLY N, et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks // Proc of the 28th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2015: 1171-1179.
[26] ZHAO Y, XU R, WANG X C, et al. Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(4): 6917-6924.
[27] JIA J W, WANG Z L, XU L H, et al. An Interference-Resistant and Low-Consumption Lip Recognition Method. Electronics, 2022, 11(19): 3066-3081.
[28] FENG D L, YANG S, SHAN S G, et al. Learn an Effective Lip Reading Model without Pains[C/OL].[2022-07-07]. https://arxiv.org/pdf/2011.07557.pdf.
[29] WANG H J, PU G Q, CHEN T Y. A Lip Reading Method Based on 3D Convolutional Vision Transformer. IEEE Access, 2022, 10: 77205-77212.
[30] SIMONYAN K, VEDALDI A, ZISSERMAN A. Deep Inside Con-volutional Networks: Visualising Image Classification Models and Saliency Maps[C/OL]. [2022-07-07]. https://arxiv.org/abs/1312.6034.